R Introduction Workshop
March 15, 2018
Find the materials for this presentation here:
github.com/karaesmen/WIA2018_data_cleaning
You can find out more about R-Ladies:
Meetup.com: meetup.com/RLadies-Columbus/
Twitter: @RLadiesColumbus
What is data science? A buzzword with a blurry definition…
What is big data?
Another buzzword with a blurry definition?
datascience@berkeleyA petabyte is a million gigabytes (GB).
A better description:
“In traditional analysis, the development of a statistical model takes more time than the calculation by the computer. When it comes to Big Data this proportion is turned upside down. Big Data comes into play when the CPU time for the calculation takes longer than the cognitive process of designing a model.” - Hadley Wickham
R can easily handle data sizes up to 100s of MBs. And additional R packages can help with sizes up to 10s of GBs, or with even larger data sizes depending on the “big data problem”.
Is your “big data” actually big data?
Small data in disguise:
Although the complete data might be big, the data needed to answer the question of interest can be small that fits in memory.
A large number of small data :
Although the data might be big, it might comprise of many small data problems. Each individual problem might fit in memory, but you have millions of them. In that case systems like Hadoop or Spark can help and R packages like sparklyr, rhipe, and ddr make this process easy.
R offers great tools for the entire Data Science process
Or interactive plots:
library(plotly)
p <- iris %>%
ggplot(aes(x=Sepal.Length, y=Sepal.Width, color=Species)) +
geom_point() +
geom_smooth()
ggplotly(p) R community #rstats on Twitter
tidyverseOnline book R for Data Science
tidyverse a neat universe of data tools.
In most cases, data doesn’t come clean (pun intended)… Requires cleaning before any modelling or visualization.
“Happy families are all alike;
every unhappy family is unhappy in its own way.”
–– Leo Tolstoy
“Tidy datasets are all alike,
but every messy dataset is messy in its own way.”
–– Hadley Wickham
Luckily there is a nice set of packages offering help with messy data sets
lm and ggplot)data.framedata.framedata.frame| Pregnant | Not.Pregnant | |
|---|---|---|
| Male | 0 | 5 |
| Female | 1 | 4 |
| Pregnant | Not.Pregnant | |
|---|---|---|
| Male | 0 | 5 |
| Female | 1 | 4 |
Sex, Pregnancy, n| sex | pregnant | n |
|---|---|---|
| male | no | 5 |
| male | yes | 0 |
| female | no | 4 |
| female | yes | 1 |
Real datasets usually come with the following problems that needs tidying:
Column headers are values, not variable names.
Multiple variables are stored in one column.
Variables are stored in both rows and columns.
Multiple types of observational units are stored in the same table.
A single observational unit is stored in multiple tables.
Now, let’s look at some tools from tidyverse that can help with these issues, while going over some of the examples from the dplyr and tidyr vignettes.
browseVignettes(package = "dplyr")
browseVignettes(package = "tidyr")pew examplepew.csvlibrary(tidyverse)
pew <- read_csv("data/pew.csv")
head(pew)| religion | <$10k | $10-20k | $20-30k | $30-40k | $40-50k | $50-75k | $75-100k | $100-150k | >150k | Don’t know/refused |
|---|---|---|---|---|---|---|---|---|---|---|
| Agnostic | 27 | 34 | 60 | 81 | 76 | 137 | 122 | 109 | 84 | 96 |
| Atheist | 12 | 27 | 37 | 52 | 35 | 70 | 73 | 59 | 74 | 76 |
| Buddhist | 27 | 21 | 30 | 34 | 33 | 58 | 62 | 39 | 53 | 54 |
| Catholic | 418 | 617 | 732 | 670 | 638 | 1116 | 949 | 792 | 633 | 1489 |
| Don’t know/refused | 15 | 14 | 15 | 11 | 10 | 35 | 21 | 17 | 18 | 116 |
| Evangelical Prot | 575 | 869 | 1064 | 982 | 881 | 1486 | 949 | 723 | 414 | 1529 |
religion, income and frequency.This action is often described as making a wide dataset long (or tall) or called “melting”.
gather() function.gather has 4 main arguments:
data: The data frame to gatherkey: Name of the key column, which will contain the collapsed column names, In this case the column name for this key column would be incomevalue: Name of the value column, which will populate the gathered values. In this case, it’s nspecify the columns to gathered: bare column names of the columns that will be collapsed. Here, it’s every column except religion.gather(data=pew, key=income, value=n, -religion)If we would use the pipe %>% same code would look like this:
pew %>%
gather(key=income, value=n, -religion) %>%
head| religion | income | n |
|---|---|---|
| Agnostic | <$10k | 27 |
| Atheist | <$10k | 12 |
| Buddhist | <$10k | 27 |
| Catholic | <$10k | 418 |
| Don’t know/refused | <$10k | 15 |
| Evangelical Prot | <$10k | 575 |
This form is tidy because each column represents a variable and each row represents an observation, which is in this case a demographic unit corresponding to a combination of religion and income.
billboardThis format is also used to record regularly spaced observations over time.
billboard.csvartist, track, date.entered, rank and week, plus the columns from wk1 to wk75.billboard <- read_csv("data/billboard.csv")
billboard[1:3, 1:10]| year | artist | track | time | date.entered | wk1 | wk2 | wk3 | wk4 | wk5 |
|---|---|---|---|---|---|---|---|---|---|
| 2000 | 2 Pac | Baby Don’t Cry (Keep… | 04:22:00 | 2000-02-26 | 87 | 82 | 72 | 77 | 87 |
| 2000 | 2Ge+her | The Hardest Part Of … | 03:15:00 | 2000-09-02 | 91 | 87 | 92 | NA | NA |
| 2000 | 3 Doors Down | Kryptonite | 03:53:00 | 2000-04-08 | 81 | 70 | 68 | 67 | 66 |
Again, to tidy this dataset, we first gather together all the wk columns. The column names give the week and the values are the ranks:
billboard2 <- billboard %>%
gather(week, rank, wk1:wk76, na.rm = TRUE)
billboard2 %>% head| year | artist | track | time | date.entered | week | rank |
|---|---|---|---|---|---|---|
| 2000 | 2 Pac | Baby Don’t Cry (Keep… | 04:22:00 | 2000-02-26 | wk1 | 87 |
| 2000 | 2Ge+her | The Hardest Part Of … | 03:15:00 | 2000-09-02 | wk1 | 91 |
| 2000 | 3 Doors Down | Kryptonite | 03:53:00 | 2000-04-08 | wk1 | 81 |
| 2000 | 3 Doors Down | Loser | 04:24:00 | 2000-10-21 | wk1 | 76 |
| 2000 | 504 Boyz | Wobble Wobble | 03:35:00 | 2000-04-15 | wk1 | 57 |
| 2000 | 98^0 | Give Me Just One Nig… | 03:24:00 | 2000-08-19 | wk1 | 51 |
Here we use na.rm to drop any missing values from the gather columns. In this data, missing values represent weeks that the song wasn’t in the charts, so can be safely dropped.
In this case it’s also nice to do a little cleaning, converting the week variable to a number, and figuring out the date corresponding to each week on the charts:
billboard3 <- billboard2 %>%
mutate(
week = parse_number(week),
date = as.Date(date.entered) + 7 * (week - 1)) %>%
select(-date.entered)
billboard3 %>% head| year | artist | track | time | week | rank | date |
|---|---|---|---|---|---|---|
| 2000 | 2 Pac | Baby Don’t Cry (Keep… | 04:22:00 | 1 | 87 | 2000-02-26 |
| 2000 | 2Ge+her | The Hardest Part Of … | 03:15:00 | 1 | 91 | 2000-09-02 |
| 2000 | 3 Doors Down | Kryptonite | 03:53:00 | 1 | 81 | 2000-04-08 |
| 2000 | 3 Doors Down | Loser | 04:24:00 | 1 | 76 | 2000-10-21 |
| 2000 | 504 Boyz | Wobble Wobble | 03:35:00 | 1 | 57 | 2000-04-15 |
| 2000 | 98^0 | Give Me Just One Nig… | 03:24:00 | 1 | 51 | 2000-08-19 |
Finally, it’s always a good idea to sort the data. We could do it by artist, track and week:
billboard3 %>% arrange(artist, track, week)Or by date and rank:
billboard3 %>% arrange(date, rank) %>% head| year | artist | track | time | week | rank | date |
|---|---|---|---|---|---|---|
| 2000 | Lonestar | Amazed | 04:25:00 | 1 | 81 | 1999-06-05 |
| 2000 | Lonestar | Amazed | 04:25:00 | 2 | 54 | 1999-06-12 |
| 2000 | Lonestar | Amazed | 04:25:00 | 3 | 44 | 1999-06-19 |
| 2000 | Lonestar | Amazed | 04:25:00 | 4 | 39 | 1999-06-26 |
| 2000 | Lonestar | Amazed | 04:25:00 | 5 | 38 | 1999-07-03 |
| 2000 | Lonestar | Amazed | 04:25:00 | 6 | 33 | 1999-07-10 |
After gathering columns, the key column is sometimes a combination of multiple underlying variable names.
>- Example dataset: tb (tuberculosis) >- From the World Health Organisation, and records the counts of confirmed tuberculosis cases by country, year, and demographic group.
>- The demographic groups are broken down by sex (m, f) and age (0-14, 15-25, 25-34, 35-44, 45-54, 55-64, unknown).
tb <- read_csv("data/tb.csv")
tb[1:3, 1:10]| iso2 | year | m04 | m514 | m014 | m1524 | m2534 | m3544 | m4554 | m5564 |
|---|---|---|---|---|---|---|---|---|---|
| AD | 1989 | NA | NA | NA | NA | NA | NA | NA | NA |
| AD | 1990 | NA | NA | NA | NA | NA | NA | NA | NA |
| AD | 1991 | NA | NA | NA | NA | NA | NA | NA | NA |
First we gather up the non-variable columns:
tb2 <- tb %>%
gather(key = demo, value = n, -iso2, -year, na.rm = TRUE)
tb2 %>% head| iso2 | year | demo | n |
|---|---|---|---|
| AD | 2005 | m04 | 0 |
| AD | 2006 | m04 | 0 |
| AD | 2008 | m04 | 0 |
| AE | 2006 | m04 | 0 |
| AE | 2007 | m04 | 0 |
| AE | 2008 | m04 | 0 |
Column headers in this format are often separated by a non-alphanumeric character (e.g. ., -, _, :), or have a fixed width format, like in this dataset. separate() makes it easy to split a compound variables into individual variables. You can either pass it a regular expression to split on (the default is to split on non-alphanumeric columns), or a vector of character positions. In this case we want to split after the first character:
tb3 <- tb2 %>%
separate(demo, c("sex", "age"), 1)
tb3 %>% head| iso2 | year | sex | age | n |
|---|---|---|---|---|
| AD | 2005 | m | 04 | 0 |
| AD | 2006 | m | 04 | 0 |
| AD | 2008 | m | 04 | 0 |
| AE | 2006 | m | 04 | 0 |
| AE | 2007 | m | 04 | 0 |
| AE | 2008 | m | 04 | 0 |
Storing the values in this form resolves a problem in the original data. We want to compare rates, not counts, which means we need to know the population.
In the original format, there is no easy way to add a population variable.
It has to be stored in a separate table, which makes it hard to correctly match populations to counts.
In tidy form, adding variables for population and rate is easy because they’re just additional columns.
The most complicated form of messy data occurs when variables are stored in both rows and columns.
weatherweather <- read_csv("data/weather.csv")
weather[1:3, 1:15]| id | year | month | element | d1 | d2 | d3 | d4 | d5 | d6 | d7 | d8 | d9 | d10 | d11 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MX17004 | 2010 | 1 | tmax | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| MX17004 | 2010 | 1 | tmin | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| MX17004 | 2010 | 2 | tmax | NA | 27.3 | 24.1 | NA | NA | NA | NA | NA | NA | NA | 29.7 |
id, year, month), spread across columns (day, d1-d31) and across rows (tmin, tmax) (minimum and maximum temperature).To tidy this dataset we first gather the day columns:
weather2 <- weather %>%
gather(key = day, value = value, d1:d31, na.rm = TRUE)
weather2 %>% head| id | year | month | element | day | value |
|---|---|---|---|---|---|
| MX17004 | 2010 | 12 | tmax | d1 | 29.9 |
| MX17004 | 2010 | 12 | tmin | d1 | 13.8 |
| MX17004 | 2010 | 2 | tmax | d2 | 27.3 |
| MX17004 | 2010 | 2 | tmin | d2 | 14.4 |
| MX17004 | 2010 | 11 | tmax | d2 | 31.3 |
| MX17004 | 2010 | 11 | tmin | d2 | 16.3 |
At this point I am removing the missing values with na.rm.
This is ok because we know how many days are in each month and can easily reconstruct the explicit missing values.
We’ll also do a little cleaning:
weather3 <- weather2 %>%
mutate(day = parse_number(day)) %>%
select(id, year, month, day, element, value) %>%
arrange(id, year, month, day)
weather3 %>% head| id | year | month | day | element | value |
|---|---|---|---|---|---|
| MX17004 | 2010 | 1 | 30 | tmax | 27.8 |
| MX17004 | 2010 | 1 | 30 | tmin | 14.5 |
| MX17004 | 2010 | 2 | 2 | tmax | 27.3 |
| MX17004 | 2010 | 2 | 2 | tmin | 14.4 |
| MX17004 | 2010 | 2 | 3 | tmax | 24.1 |
| MX17004 | 2010 | 2 | 3 | tmin | 14.4 |
element column is not a variable; it stores the names of variables.element and value columns back out into the columns:weather4 <- weather3 %>%
spread(key = element, value = value)
weather4 %>% head| id | year | month | day | tmax | tmin |
|---|---|---|---|---|---|
| MX17004 | 2010 | 1 | 30 | 27.8 | 14.5 |
| MX17004 | 2010 | 2 | 2 | 27.3 | 14.4 |
| MX17004 | 2010 | 2 | 3 | 24.1 | 14.4 |
| MX17004 | 2010 | 2 | 11 | 29.7 | 13.4 |
| MX17004 | 2010 | 2 | 23 | 29.9 | 10.7 |
| MX17004 | 2010 | 3 | 5 | 32.1 | 14.2 |
This form is tidy: there’s one variable in each column, and each row represents one day.
billboardThe billboard dataset actually contains observations on two types of observational units: the song and its rank in each week. This manifests itself through the duplication of facts about the song: artist, year and time are repeated many times.
billboard3 %>%
head| year | artist | track | time | week | rank | date |
|---|---|---|---|---|---|---|
| 2000 | 2 Pac | Baby Don’t Cry (Keep… | 04:22:00 | 1 | 87 | 2000-02-26 |
| 2000 | 2Ge+her | The Hardest Part Of … | 03:15:00 | 1 | 91 | 2000-09-02 |
| 2000 | 3 Doors Down | Kryptonite | 03:53:00 | 1 | 81 | 2000-04-08 |
| 2000 | 3 Doors Down | Loser | 04:24:00 | 1 | 76 | 2000-10-21 |
| 2000 | 504 Boyz | Wobble Wobble | 03:35:00 | 1 | 57 | 2000-04-15 |
| 2000 | 98^0 | Give Me Just One Nig… | 03:24:00 | 1 | 51 | 2000-08-19 |
This dataset needs to be broken down into two pieces:
- a song dataset which stores artist, song name and time
- and a ranking dataset which gives the rank of the song in each week.
We first extract a song dataset:
song <- billboard3 %>%
select(artist, track, year, time) %>%
unique() %>%
mutate(song_id = row_number())
song %>% head| artist | track | year | time | song_id |
|---|---|---|---|---|
| 2 Pac | Baby Don’t Cry (Keep… | 2000 | 04:22:00 | 1 |
| 2Ge+her | The Hardest Part Of … | 2000 | 03:15:00 | 2 |
| 3 Doors Down | Kryptonite | 2000 | 03:53:00 | 3 |
| 3 Doors Down | Loser | 2000 | 04:24:00 | 4 |
| 504 Boyz | Wobble Wobble | 2000 | 03:35:00 | 5 |
| 98^0 | Give Me Just One Nig… | 2000 | 03:24:00 | 6 |
Then use that to make a rank dataset by replacing repeated song facts with a pointer to song details (a unique song id):
rank <- billboard3 %>%
left_join(song, c("artist", "track", "year", "time")) %>%
select(song_id, date, week, rank) %>%
arrange(song_id, date)
rank %>% head| song_id | date | week | rank |
|---|---|---|---|
| 1 | 2000-02-26 | 1 | 87 |
| 1 | 2000-03-04 | 2 | 82 |
| 1 | 2000-03-11 | 3 | 72 |
| 1 | 2000-03-18 | 4 | 77 |
| 1 | 2000-03-25 | 5 | 87 |
| 1 | 2000-04-01 | 6 | 94 |
week dataset which would record background information about the week, maybe the total number of songs sold or similar “demographic” information.Read the files into a list of tables.
For each table, add a new column that records the original file name (the file name is often the value of an important variable).
Combine all tables into a single table.
map() function from purr makes this straightforward in R.data/) which match a regular expression (iris* - any file starts with “iris”).map_df()(paths <- dir("data", pattern = "iris*", full.names = TRUE))
#> [1] "data/iris1.txt" "data/iris2.txt" "data/iris3.txt"
map_df(paths, read_tsv) %>% head| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
|---|---|---|---|---|
| 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 5.0 | 3.6 | 1.4 | 0.2 | setosa |
| 5.4 | 3.9 | 1.7 | 0.4 | setosa |
paths.read_files(), it will read the data in and then add a new column file with the name of the file.basename function to get the bare name of the file without the full path.basename(paths)
#> [1] "iris1.txt" "iris2.txt" "iris3.txt"
read_files <- function(x) {
read_tsv(x) %>%
mutate(file=basename(x))
}
map_df(paths, read_files) %>% head| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | file |
|---|---|---|---|---|---|
| 5.1 | 3.5 | 1.4 | 0.2 | setosa | iris1.txt |
| 4.9 | 3.0 | 1.4 | 0.2 | setosa | iris1.txt |
| 4.7 | 3.2 | 1.3 | 0.2 | setosa | iris1.txt |
| 4.6 | 3.1 | 1.5 | 0.2 | setosa | iris1.txt |
| 5.0 | 3.6 | 1.4 | 0.2 | setosa | iris1.txt |
| 5.4 | 3.9 | 1.7 | 0.4 | setosa | iris1.txt |